class: center, middle, inverse, title-slide # Chicago Taxi Data Analysis ## SCV Group ### ### 27 Oct 2021 --- class: title-slide,middle background-image: url("pikapika.svg"), url("assets/taxi.jpeg") background-position: 10% 90%, 100% 50% background-size: 100px, 100% 100% <!-- background-color: #0148A4 --> ## .black[Hi! During your Journey,] ## .black[Drive safe &] ## .white[Scroll each page please] ## .black[Enjoy! :-)] --- ## Visualise Data .scroll-output[ First of all, we do the basic analysis on the dataset and look at each variable. For the continues variables, they were visualised using ggpair function in GGally package, a powerful tool that gives scatterplot for each pair of the variables and a histogram for each single variable, as well as the correlation between each pair of variables. For instance, we expect to see a correlation between trip_total and trip_miles. In the scatter plot, we can spot an almost linear line and also the outliners. In the correlation cell, we get a correlation coefficient of 0.467, which indicates a moderate positive association. <img src="./assets/1.png" width="5120" /> To further visualise the variable trip_total which is of our main interest, we use a geom_histogram in ggplot2 and carefully select the binwidth. <img src="./assets/4.png" width="1867" /> ## Boxplots: Revenue vs. Zone With regards to discrete variables, we have 77 zones in Chicago and it is too much to plot out. However, since we are interested in the zones the generate most revenue, we plot out the top 10 zones. To visualise the distribution of the data as well as give a comprehensive comparison among zones, we use boxplots in parallel. From the plots below, it is clear that zone 76 and zone 56 generate the most revenue, whereas they are also the two zones where the average avenue reaches the highest, hence they might be where drivers like most. While plotting, we found extreme outliners with trip_total > 200. To better visualise the data, we set yscale to exclude them. <img src="./assets/5.png" width="1867" /> Further, we will prefer using median over mean because we noticed that most of the distributions of revenue are skewed. <img src="./assets/6.png" width="5120" /> As for the dropoff area, it is obvious that zone 76 produced most revenue. <img src="./assets/7.png" width="1867" /> However, we prefer using pickup_community_area over dropoff_comunity_area to answer our question since we see the pickup zone as the location where new revenue are generated and are of driver’s interest. ] --- ## Total Revenue vs. Zone Analysis by Maps .scroll-output[
After looking at the revenue map, we wonder why some of places generate more revenue than others? To further explore the reason, we introduce outside dataset about the amenities as well as the population data from Wikipedia. ## Population We think population might be one of the main factors that affects the total avenue of each zone. Hence we extract the population data from Wikipedia page and join it into our original dataset. We apply logarithmic function on population to make the map more sensible. After comparing the two maps, we think the population might be one of the exploratory variables to the total revenue.
## Load Open Street Map Data Apart from population, we think the location of certain city amenities might be another main factor that affects the total avenue of each zone. Hence we extract amenity information including theatre, restaurant and hospital from Open Street Map Data (osmdata) package and plot it out using the same package. Due to the limit of the dataset, we cannot get all the places where people are likely to take a taxi. But we can still identify the places with most clustered orange dots are popular area for taking taxi, also they might be in the centre of Chicago. <img src="./assets/2.png" width="1867" /> To make it easier to compare, we plot out all roads for cars and add the zone border into the map, as well as the amenities on top of them. <img src="./assets/3.png" width="1867" /> Now, it becomes clearer that the reason for higher revenue in certain zone might be related to the density of certain amenities. The only exception is O’Hare, where not many amenities are gathered but a hot spot for taxis. It turned out to be that O’Hare is the location of Chicago O'Hare International Airport, which explains why it attracts most of the taxis and produces such a high revenue. <!-- --> From the animation, we can spot ... ] --- ## Shiny App: Trip data by pickup/dropoff areas of Chicago .orange[By Thomas] .scroll-output[ ### by pickup areas of Chicago In this section we will comment an interactive map of Chicago that displays the _sum_, _mean_ and _max_ for the time of the day we want from 1 A.M. to 12 A.M. for each pickup areas of the city. ```r shinyApp(ui = ui, server = server) ``` .red[Please click on this link for the shiny app: xxx.xxxx.xxx] For the summation of the revenue per area, we see that for most of the time it is the city center (_Near West Side_, _Loop_ and _Near North Side_) and the airports (_O'Hare_ and _Garfield Ridge_) that have higher values, except during the night for which the values go down for the airports. This could be due to the fact that there is less flights coming to and going from airports at night. For the city center, it should be due to the fact that in every city there is more activity in the center of it. For the mean of the revenue per area, we can see that it is quite homogeneous on the border of Michigan Lake. But airports are the areas with some of the higher mean, as before, during the day. For the max of the revenue per area, we see that the maximal value attained for each time of the day comes from city center, particularly from _Near North Side_ and _Loop_, values that are more than 5000$ most of the time. This could be due to the fact that people that pick a taxi from the city center have a long journey to do. ### by drop-off areas of Chicago This section is similar to the previous one, but instead of looking at pickup areas, we look at drop-off areas and end time of each trip. As before, we can take a look at each hour of the day from 1 A.M. to 12 A.M. ```r shinyApp(ui = ui, server = server2) ``` .red[Please click on this link for the shiny app: xxx.xxxx.xxx] For the summation part, we can see quit the same as for the pickup map but the highest values are less concentrated in the city center and more scattered around it. We have the same conclusion for the airports. For the mean case, the values are a lot more homogeneous than for pickups and dispersed over all the city. For the maximal values, we see that the higher values appearing are from the city center to _Lake View_, an area in the north. As before, airports have high values. To conclude, this was a brief analyze of the basis data in order to have an idea of the situation for both pickup and drop-off. We see that it would be better to analyze deeper the pickup case if we wish to answer our questions. . ] --- ## Time Map .scroll-output[
] <!-- .footnote[ --> <!-- By Yiren --> <!-- ] --> --- ## Why is it the case? .orange[By Yiren] .scroll-output[ To find out the reason why specific zones generate most revenue as well as why some zones give less pick up time, we join several other dataset downloaded from Chicago governmental website and look for correlation among the factors. Initially when we plot the correlation out, we can clearly see outliners that greatly affect the correlation coefficient, hence we drop them and plot it again. For this plot, .orange[you can click on each cell and see a correlation between the variables]. Our dependent variable `trip_total` is placed at the last column, it has negative correlations with `percent_aged_under_18_or_over_64`, `percent_aged_25_without_high_school_diploma`, `percent_of_housing_crowded`, positive correlation with `per_capita_income`, `population` and `density`. As for `avg_pickup_time` placed at the second last column, it has a negative correlation with `percent_households_below_poverty` and `percent_aged_16_unemployed`.
## Multiple Regression by Backward Model Selection <!-- ```{r, echo = FALSE} --> <!-- load("allvars_shape.RData") --> <!-- # load("test.RData") --> <!-- # tmap_mode("view") --> <!-- # df1 <- df1 %>% drop_na() %>% rename(population = Population.x) %>% st_as_sf() --> <!-- colnames(df1) --> <!-- p1 = tm_shape(df1) + --> <!-- tm_polygons(col = "area") + --> <!-- tm_layout(title = "Chicago Area", title.size = 1, title.position = c("right", "TOP")) + --> <!-- tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),) --> <!-- p2 = tm_shape(df1) + --> <!-- tm_polygons(col = "per_capita_income") + --> <!-- tm_layout(title = "Chicago per_capita_income", title.size = 1, title.position = c("right", "TOP")) +tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),legend.title.size = 1, --> <!-- legend.text.size = 0.6, --> <!-- legend.bg.color = "white", --> <!-- # legend.digits = 5, --> <!-- legend.bg.alpha = 1) --> <!-- p3 = tm_shape(df1) + --> <!-- tm_polygons(col = "population") + --> <!-- tm_layout(title = "Chicago population", title.size = 1, title.position = c("right", "TOP")) + --> <!-- tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"), legend.title.size = 1, --> <!-- legend.text.size = 0.6, --> <!-- legend.bg.color = "white", --> <!-- # legend.digits = 5, --> <!-- legend.bg.alpha = 1) --> <!-- p4 = tm_shape(df1) + --> <!-- tm_polygons(col = "percent_of_housing_crowded") + --> <!-- tm_layout(title = "Chicago percent_of_housing_crowded", title.size = 1, title.position = c("right", "TOP")) + --> <!-- tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),) --> <!-- p5 = tm_shape(df1) + --> <!-- tm_polygons(col = "percent_aged_16_unemployed") + --> <!-- tm_layout(title = "Chicago percent_aged_16_unemployed", title.size = 1, title.position = c("right", "TOP")) +tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),legend.title.size = 1, --> <!-- legend.text.size = 0.6, --> <!-- legend.bg.color = "white", --> <!-- # legend.digits = 5, --> <!-- legend.bg.alpha = 1) --> <!-- p6 = tm_shape(df1) + --> <!-- tm_polygons(col = "percent_aged_25_without_high_school_diploma") + --> <!-- tm_layout(title = "Chicago percent_aged_25_without_high_school_diploma", title.size = 1, title.position = c("right", "TOP")) + --> <!-- tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"), legend.title.size = 1, --> <!-- legend.text.size = 0.6, --> <!-- legend.bg.color = "white", --> <!-- # legend.digits = 5, --> <!-- legend.bg.alpha = 1) --> <!-- p7 = tm_shape(df1) + --> <!-- tm_polygons(col = "percent_aged_under_18_or_over_64") + --> <!-- tm_layout(title = "Chicago percent_aged_under_18_or_over_64", title.size = 1, title.position = c("right", "TOP")) + --> <!-- tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),) --> <!-- p8 = tm_shape(df1) + --> <!-- tm_polygons(col = "hardship_index") + --> <!-- tm_layout(title = "Chicago hardship_index", title.size = 1, title.position = c("right", "TOP")) +tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"),legend.title.size = 1, --> <!-- legend.text.size = 0.6, --> <!-- legend.bg.color = "white", --> <!-- # legend.digits = 5, --> <!-- legend.bg.alpha = 1) --> <!-- # p9 = tm_shape(df1) + --> <!-- # tm_polygons(col = "percent_aged_16_unemployed") + --> <!-- # tm_layout(title = "Chicago percent_aged_16_unemployed", title.size = 1, title.position = c("right", "TOP")) + --> <!-- # tm_layout(frame.lwd = 3, legend.position = c("left", "bottom"), legend.title.size = 1, --> <!-- # legend.text.size = 0.6, --> <!-- # legend.bg.color = "white", --> <!-- # # legend.digits = 5, --> <!-- # legend.bg.alpha = 1) --> <!-- # tmap_mode("plot") --> <!-- ``` --> <!-- ```{r, echo = FALSE, warning = FALSE, message = FALSE, fig.height=3, figwidth = 5} --> <!-- t1<- tmap_arrange(p1,p2,p3, ncol = 3) --> <!-- t2<- tmap_arrange(p4,p5,p6, ncol = 3) --> <!-- t3<- tmap_arrange(p7, p8, ncol = 2) --> <!-- ``` --> <img src="./assets/8.png" width="1867" /><img src="./assets/9.png" width="1867" /><img src="./assets/10.png" width="2484" /> First, we perform multiple regression for trip_total. To find the most appropriate model, we start with the model containing all possible explanatory variables. And progressively, we remove the least informative variable and stop until all variables in the current model are significant when `\(\alpha = 0.05\)`. We do a backward search using Akaike Information Criterion (AIC) and get the following model. ``` ## ## Call: ## lm(formula = trip_total ~ population + percent_of_housing_crowded + ## percent_aged_16_unemployed + percent_aged_under_18_or_over_64, ## data = df) ## ## Coefficients: ## (Intercept) population ## 700722.131 3.136 ## percent_of_housing_crowded percent_aged_16_unemployed ## -27518.550 19959.164 ## percent_aged_under_18_or_over_64 ## -19294.557 ``` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 700722.131 212619.909 3.296 0.002 ## population 3.136 1.379 2.273 0.026 ## percent_of_housing_crowded -27518.550 6744.633 -4.080 0.000 ## percent_aged_16_unemployed 19959.164 4129.041 4.834 0.000 ## percent_aged_under_18_or_over_64 -19294.557 5914.440 -3.262 0.002 ``` Therefore, we get `$${\text{trip_total}} = 700722.131 + 3.136 \times \text{population} -27518.550 \times \text{percent_of_housing_crowded} + 19959.164 \times \text{percent_aged_16_unemployed} -19294.557 \times \text{percent_aged_under_18_or_over_64}$$` Since the P values for each variable and intercept are all less than 0.05, we keep all of them in our formula for calculating trip_total. ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.45 0.42 0 ``` We get the adjusted r squared to be 0.42, which means 42% of the variance for the dependent variable `trip_total` can be explained by our independent variables, and hence it is a relatively good model. Then, we apply the same process on the `avg_pickup_time`. ``` ## ## Call: ## lm(formula = avg_pickup_time ~ percent_of_housing_crowded + percent_aged_25_without_high_school_diploma + ## percent_households_below_poverty + percent_aged_under_18_or_over_64, ## data = df) ## ## Coefficients: ## (Intercept) ## 15.2134 ## percent_of_housing_crowded ## 0.6735 ## percent_aged_25_without_high_school_diploma ## -0.1839 ## percent_households_below_poverty ## -0.1646 ## percent_aged_under_18_or_over_64 ## 0.1300 ``` ``` ## Estimate Std. Error t value ## (Intercept) 15.213 3.098 4.910 ## percent_of_housing_crowded 0.674 0.298 2.262 ## percent_aged_25_without_high_school_diploma -0.184 0.101 -1.813 ## percent_households_below_poverty -0.165 0.051 -3.199 ## percent_aged_under_18_or_over_64 0.130 0.093 1.392 ## Pr(>|t|) ## (Intercept) 0.000 ## percent_of_housing_crowded 0.027 ## percent_aged_25_without_high_school_diploma 0.074 ## percent_households_below_poverty 0.002 ## percent_aged_under_18_or_over_64 0.169 ``` `$${\text{avg_pickup_time}} = 15.2134 + 0.6735 \times \text{percent_of_housing_crowded} -0.1839 \times \text{percent_aged_25_without_high_school_diploma} + -0.1646 \times \text{percent_households_below_poverty} -0.1300 \times \text{percent_aged_under_18_or_over_64}$$` ``` ## # A tibble: 1 x 3 ## r.squared adj.r.squared p.value ## <dbl> <dbl> <dbl> ## 1 0.21 0.16 0 ``` We get a nice formula for `avg_pickup_time`, however, the adjusted R squared is just 0.16, which indicates that it is not a good model and we need to find other factors to better explain `avg_pickup_time`, which also suggests a direction for our future work. ] <!-- --- --> <!-- ## Pre Analysis Observations --> <!-- There are ten variables in this dataset, as shown by the summary statistics: --> <!-- - Mean birth weight is approximately 2.9kg, about 600 grams below the expected mean of 3.5kg. This is within the normal range of weights of 2.5kg- --> <!-- .scroll-box-14[ --> <!-- ```{r, echo = FALSE, warning=FALSE} --> <!-- library(MASS) --> <!-- library(janitor) --> <!-- library(skimr) --> <!-- library(tidyr) --> <!-- library(readr) --> <!-- library(ggfortify) --> <!-- library(dplyr) --> <!-- library(tidyverse) --> <!-- library(ggplot2) --> <!-- library(visdat) --> <!-- library(sjPlot) --> <!-- library(leaps) --> <!-- library(caret) --> <!-- library(regclass) --> <!-- data = birthwt --> <!-- summary(data) --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- ## birthwt in R --> <!-- Here a glimpse of the data set we are going to use in the further research. --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE} --> <!-- data = birthwt --> <!-- glimpse(data) --> <!-- ``` --> <!-- --- --> <!-- ### data handling --> <!-- # data --> <!-- .scroll-box-14[ --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE} --> <!-- data = MASS::birthwt --> <!-- glimpse(data) --> <!-- summary(data) --> <!-- head(data) --> <!-- # data %>% skimr::skim() --> <!-- # visdat::vis_miss(data) --> <!-- data = data %>% mutate( --> <!-- low = as.factor(low), --> <!-- race = as.character(race), --> <!-- smoke = as.character(smoke), --> <!-- ui = as.character(ui), --> <!-- ht = as.character(ht) --> <!-- ) --> <!-- data_without_low = data %>% dplyr::select(-low) --> <!-- data_with_low = data --> <!-- ``` --> <!-- ] --> <!-- --- --> <!-- class: title-slide,middle --> <!-- background-image: url("assets/pika2.svg"), url("assets/title-image3.jpg") --> <!-- background-position: 10% 90%, 100% 50% --> <!-- background-size: 160px, 50% 100% --> <!-- background-color: #0148A4 --> <!-- # .text-shadow[.white[Assumption Checking]] --> <!-- ## .white[Yiren Cao] --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=16, fig.height=15} --> <!-- data_without_low_for_correlation = data_without_low --> <!-- data_without_low_for_correlation[] <- lapply(data_without_low_for_correlation, --> <!-- function(x) as.numeric(as.character(x))) --> <!-- {{qtlcharts::iplotCorr(data_without_low_for_correlation)}} # non-multicollinearity --> <!-- ``` --> <!-- --- --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=8} --> <!-- GGally::ggpairs(data_without_low) --> <!-- ``` --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity, Normality --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=6} --> <!-- lm1_without_low = lm(bwt ~ ., data = data_without_low) --> <!-- autoplot(lm1_without_low, which = 1:2) # linearity + normality --> <!-- ``` --> <!-- ```{r, echo = FALSE} --> <!-- icon::fa("bell") --> <!-- ``` --> <!-- - In addition, we can assume the Independence! --> <!-- --- --> <!-- ## Assumption Checking - No Multicollinearity --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=10, fig.height=5} --> <!-- M1 = lm1_without_low # Full model --> <!-- vif_values <- regclass::VIF(M1) --> <!-- vif_values --> <!-- ``` --> <!-- .blockquote[ --> <!-- ###
Variance Inflation Factor (VIF) --> <!-- - when VIF is equal to 1, the independent variables are not correlated to the one another --> <!-- ] --> <!-- --- --> <!-- class: inverse, center, middle --> <!-- # Another model: with `low` predictor --> <!-- --- --> <!-- ## Assumption Checking - Linearity, Homoskedasticity, Normality --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=12, fig.height=6} --> <!-- reg1 = lm(bwt ~ ., data = data_with_low) --> <!-- autoplot(reg1, which = 1:2) --> <!-- ``` --> <!-- ```{r, echo = FALSE} --> <!-- icon::fa("bell") --> <!-- ``` --> <!-- - In addition, we can assume the Independence! --> <!-- --- --> <!-- ## Assumption Checking - No Multicollinearity --> <!-- ```{r, message=FALSE, warning=FALSE, fig.width=12, fig.height=5} --> <!-- vif_values_with_low <- regclass::VIF(reg1) --> <!-- vif_values_with_low --> <!-- ``` --> <!-- --- --> <!-- Please add analysis here --> <!-- --- --> <!-- class: title-slide,middle --> <!-- background-image: url("assets/pika2.svg"), url("assets/title-image2.jpg") --> <!-- background-position: 10% 90%, 100% 50% --> <!-- background-size: 160px, 50% 100% --> <!-- background-color: #0148A4 --> <!-- # .text-shadow[.white[Model Selection]] --> <!-- # .text-shadow[.white[Assumption Re-check]] --> <!-- ## .white[Yiren Cao] --> <!-- --- --> <!-- ## Model Selection --> <!-- ```{r, echo = FALSE, message=FALSE, warning=FALSE, fig.width=5, fig.height=5} --> <!-- cv_with_low = train( --> <!-- bwt ~ low + ui+ smoke + race, data, --> <!-- method = "lm", --> <!-- trControl = trainControl( --> <!-- method = "cv", number = 10, --> <!-- verboseIter = FALSE --> <!-- ) --> <!-- ) --> <!-- cv_without_low = train( --> <!-- bwt ~ lwt + race + smoke + ht+ ui, data_without_low, --> <!-- method = "lm", --> <!-- trControl = trainControl( --> <!-- method = "cv", number = 10, --> <!-- verboseIter = FALSE --> <!-- ) --> <!-- ) --> <!-- # cv_without_low --> <!-- # cv_with_low --> <!-- results = resamples(list(without_low = cv_without_low, with_low = cv_with_low)) --> <!-- ggplot(results, metric = "RMSE") + labs(y = "RMSE") --> <!-- ggplot(results, metric = "MAE") + labs(y = "MAE") --> <!-- ggplot(results, metric = "Rsquared") + labs(y = "R squared") --> <!-- icon::fa("spinner", size = 2, animate = "spin") --> <!-- ``` --> <!-- --- --> <!-- ## Assumption Re-check --> <!-- ```{r, echo = FALSE, fig.width=10, fig.height=5} --> <!-- step.back.aic = step(M1, direction = "backward", trace = FALSE) --> <!-- autoplot(step.back.aic, which = 1:2) --> <!-- vif_values_without_low <- regclass::VIF(step.back.aic) --> <!-- vif_values_without_low --> <!-- ``` -->